Remove bed_reader #400

benjeffery · 2025-05-22T13:54:15Z

Fixes #397

coveralls · 2025-05-22T13:59:03Z

coverage: 98.212% (+0.03%) from 98.182%
when pulling 08ceb74 on benjeffery:remove_bed_reader
into b87be4c on sgkit-dev:main.

benjeffery · 2025-05-22T14:00:33Z

Plink tests run somewhat faster with this change! 10% or so

jeromekelleher · 2025-05-22T15:13:44Z

Haha, well, there you go! We'll need to modularise and test more, but I think that'll be quicker than dealing with all the details about packaging.

One quick question: is there a reason you didn't use pandas for the text files? I think this would be worthwhile and it would mean that we can port over the sgkit conversion functions like read_fam I'm happy with pandas as a dependency, it's pretty much universal now.

jeromekelleher · 2025-05-22T15:15:02Z

@tomwhite - do you see any issues with dropping bed_reader and using the lookup table approach here for decoding bed? I see a lot of advantages...

tomwhite · 2025-05-22T15:37:23Z

@tomwhite - do you see any issues with dropping bed_reader and using the lookup table approach here for decoding bed? I see a lot of advantages...

I'm pleased to see this. It's not like bed_reader supports PLINK 2, so doing this wouldn't cut off a future migration path. We probably want more tests for corner cases that bed_reader supports.

Can you do the same thing for BGEN? 😄

benjeffery · 2025-05-22T15:39:03Z

is there a reason you didn't use pandas for the text files

Didn't want to add a dependency while removing one! If you're happy with pandas then happy to switch to it.

jeromekelleher · 2025-05-22T15:49:52Z

Can you do the same thing for BGEN?

No - it's too complicated for this kind of treatment unfortunately

jeromekelleher · 2025-05-22T15:51:55Z

I'm happy to commit an initial version of this that just removes the bed_reader dep @benjeffery, and we can log issues for porting in the sgkit auxiliary file reading code and adding more tests for the data reading.

benjeffery · 2025-05-22T16:36:11Z

While writing this, I noticed that we're not storing the plink "family ID" (or parent ids) anywhere. Should those be included in the zarr?

jeromekelleher · 2025-05-22T16:55:41Z

Leaving them out for now, there's an issue open to track

jeromekelleher · 2025-05-22T19:37:05Z

I'll follow up on this here @benjeffery, I'm going to tack on some commits to move in the sgkit parsers and tests and will modularise the reader.

jeromekelleher · 2025-05-22T21:31:58Z

I've added some tests that generates a bunch of BED files using bed_reader, and I think it's looking solid. I haven't looked at performance at all - will need to source a big plink tomorrow and try it out.

tomwhite

Looks great!

benjeffery

Very nice, thanks for tidying up my rushed prototype!

benjeffery · 2025-05-23T09:29:34Z

bio2zarr/plink.py

+            if magic != b"\x6c\x1b\x01":
+                raise ValueError("Invalid BED file magic bytes")
+
+        # We could check the size of the bed file here, but that would


Good point about streams - I guess there is no way to know if the user has inconsistent bim/fam/bed files, some combinations of which would just give silently corrupted data.

Yeah, that's just how it is in the real world. There's no point in ruling out useful functionality just to do checking that other people don't bother with anyway

Remove bed_reader

36def1b

benjeffery force-pushed the remove_bed_reader branch from a568103 to 36def1b Compare May 22, 2025 13:58

jeromekelleher added 5 commits May 22, 2025 20:38

Remove bed_reader/plink from CI tests

81443fe

Add read_bim function based on sgkit

7fe383a

Use the pandas-based FAM reader

ac9f1d9

Separate out the BedReader code

bf56b86

Remove bed_reader dep and add pandas

766dbd9

jeromekelleher force-pushed the remove_bed_reader branch from b5b56f6 to 5f580a5 Compare May 22, 2025 21:25

jeromekelleher marked this pull request as ready for review May 22, 2025 21:26

Test on generated bed files written by bed_reader

08ceb74

jeromekelleher force-pushed the remove_bed_reader branch from 5f580a5 to 08ceb74 Compare May 22, 2025 21:32

jeromekelleher requested a review from tomwhite May 22, 2025 21:32

tomwhite approved these changes May 23, 2025

View reviewed changes

benjeffery commented May 23, 2025

View reviewed changes

jeromekelleher added this pull request to the merge queue May 23, 2025

Merged via the queue into sgkit-dev:main with commit 6eb88ed May 23, 2025
15 checks passed

benjeffery deleted the remove_bed_reader branch May 23, 2025 10:55

Remove bed_reader #400

Remove bed_reader #400

Uh oh!

Conversation

benjeffery commented May 22, 2025

Uh oh!

coveralls commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjeffery commented May 22, 2025

Uh oh!

jeromekelleher commented May 22, 2025

Uh oh!

jeromekelleher commented May 22, 2025

Uh oh!

tomwhite commented May 22, 2025

Uh oh!

benjeffery commented May 22, 2025

Uh oh!

jeromekelleher commented May 22, 2025

Uh oh!

jeromekelleher commented May 22, 2025

Uh oh!

benjeffery commented May 22, 2025

Uh oh!

jeromekelleher commented May 22, 2025

Uh oh!

jeromekelleher commented May 22, 2025

Uh oh!

jeromekelleher commented May 22, 2025

Uh oh!

tomwhite left a comment

Choose a reason for hiding this comment

Uh oh!

benjeffery left a comment

Choose a reason for hiding this comment

Uh oh!

benjeffery May 23, 2025

Choose a reason for hiding this comment

Uh oh!

jeromekelleher May 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coveralls commented May 22, 2025 •

edited

Loading